Pitchfork is a highly influential, industry-leading online music publication producing original album reviews in addition to features, news articles, and more. Pitchfork's album review ratings system assigns a score from 0 to 10 based on internal polling. An album review score can have significant effect on artists' careers, affecting sales and marketing, live touring and festival opportunities, licensing placement and fees, digital streamining playlists and radio pickup, physical distribution orders, future partnership opportunities and terms, and more. Generally a rating score of 7.0 or above is considered good, while 8.0 or above is considered excellent. Conversely, a score below 6.5 is considered poor, while a score below 6.0 is considered bad.
This project analyzes 6000 reviews collected between December 1, 2017 and April 27, 2023 through web scraping the Pitchfork website, and asks the following questions:
The inspiration for this project is this: as a long-time Pitchfork reader, I've frequently read reviews that didn't seem to match the rating the album received; the phenomenon of albums with low ratings receiving what I perceived to be favorable reviews, and (less frequently), vice versa. I've also wondered if a particular writer or label had an effect on the rating.
Data was filtered to focus on new reviews, removing 336 reissue reviews from the dataset, as ratings distribution of reissues skewed towards higher ratings: the average rating amongst reissues between 2017 and 2023 is 8.6, whereas the average rating amongst new releases during the same period of time is 7.2.
VADER (Valence Aware Dictionary for Sentiment Reasoning) Sentiment Analysis of reviews scored 45.8% of reviews as positive or very positive, 42% of reviews as average, and 12.2% of reviews as negative or very negative (only 0.2% of reviews rated "very negative"). There is no observable relationship between sentiment score and pitchfork rating, as the average rating of all new reviews, 7.2, was consistent across all sentiment categories, and several albums that received a rating of 10 had reviews that were scored by the VADER algorithm as "average" or "negative". Attempts to improve the algorithm's accuracy proved fruitless.
Network Analysis of Pitchfork writers with respect to genres covered identified the writers who are potentially the most influential and connected as observed through degree centrality. These are not the writers that have written the most reviews. Other centrality measures were observed, but were highly collinear. Amongst writers with at least 10 reviews, none of the writers with the ten highest average ratings had the ten highest degree centrality measures. No correlation between centrality measures and ratings were observed.
Network Analysis of record labels with respect to genres of music proved fruitless, as nearly 69% of record labels had albums reviewed from one single genre. However, the record labels with the ten highest average ratings of their albums of labels with at least ten reviews had a mean number of genres of 5 -- indicating that labels which release records in more than one genre tend to receive better ratings. A network analysis of labels with at least ten reviews identified the most influential and connected labels within the network, but no observable relationship exists between centrality measures and average Pitchfork rating.
Scatterplots and Correlation Matrices of the data indicated no correlation between the dependent variable (rating), and any of the independent variables collected. Predictive modeling of the dataset was unsuccessful.
A dataset of 6000 reviews was acquired by scraping Pitchfork's website using a original python scripts. The following data fields were extracted from the Pitchfork website:
Web scraping code can be found in Appendix A. One note: from time to time, Pitchfork produces multi-album reviews by a single artist. An example can be found here. These multi-album reviews are typically older releases, and are featured as a carousel selection of the Pitchfork website. As a result, these reviews require modifications to the web scraper in order to retrieve the review text in the same format as the single-album reviews, which proved challenging, and since this analysis focuses on new reviews, the multi-album reviews were excluded from the data before saving to disk.
# previewing the full dataset
data.head()
| artist | album | genre | label | releaseYear | writer | published | bestNewMusic | bestNewReissue | rating | review | url | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | The National | First Two Pages of Frankenstein | Rock | 4AD | 2023 | Brad Shoup | 2023-04-27 | False | False | 6.6 | ’s ninth album, First Two Pages of Frankenstei... | https://pitchfork.com/reviews/albums/the-natio... |
| 1 | Bill Evans | Treasures: Solo, Trio and Orchestra Recordings... | Jazz | Elemental Music | 2023 | Mark Richardson | 2023-04-27 | False | False | 8.1 | If’is the most common entry point for a new ja... | https://pitchfork.com/reviews/albums/bill-evan... |
| 2 | Remi Kabaka | Son of Africa | Rock | BBE Music | 2023 | Dean Van Nguyen | 2023-04-27 | False | False | 7.4 | Musician, a Ghanaian-born Nigerian, first land... | https://pitchfork.com/reviews/albums/remi-kaba... |
| 3 | Portrayal of Guilt | Devil Music | Rock | Run for Cover | 2023 | Marty Sartini Garner | 2023-04-27 | False | False | 7.4 | There are so many ways music can make people u... | https://pitchfork.com/reviews/albums/portrayal... |
| 4 | Jim Legxacy | HNPM | Rap / Experimental | (!) | 2023 | Dylan Green | 2023-04-26 | False | False | 7.8 | Rapper-producer opens “miley’s riddim,” the th... | https://pitchfork.com/reviews/albums/jim-legxa... |
The following steps were taken to clean the data:
review columnauthor column is renamed writerscore column is renamed ratingreleaseYear column is converted to string and stripped of its floatdata_new: new albumsdata_reissues: reissue albums dr = data_reissues.rating.describe().round(1).reset_index().rename(columns={'rating':'reissues'})
dn = data_new.rating.describe().round(1).reset_index().rename(columns={'rating':'new albums'})
print(tabulate(pd.merge(dr,dn,on='index'), headers="keys", tablefmt="github", showindex=False, numalign="right"))
| index | reissues | new albums | |---------|------------|--------------| | count | 336 | 5301 | | mean | 8.6 | 7.2 | | std | 0.8 | 0.8 | | min | 5.6 | 1.6 | | 25% | 8.1 | 6.8 | | 50% | 8.7 | 7.3 | | 75% | 9.2 | 7.7 | | max | 10 | 10 |
fig, axes = plt.subplots(1, 2, figsize=(21, 6), layout='constrained')
# reissues
sns.boxplot(ax=axes[0], x=data_reissues.rating)
axes[0].title.set_text('reissues')
# new albums
sns.boxplot(ax=axes[1], x=data_new.rating)
axes[1].title.set_text('new albums')
for i in range(len(axes)):
axes[i].set_xlim(0,10)
Both datasets have a similar standard deviation, although the mean rating as noted in findings is substantially higher amongst reissues than new albums. We see our new albums dataset now includes 5301 observations. We also see there are a lot of outliers in the ratings of new albums below 5.5. Let's take a look at the distribution of ratings over time to see if there are any noticeable trends to observe.
fig, axes = plt.subplots(1, 3, figsize=(21, 5), layout='constrained')
# daily
sns.lineplot(ax=axes[0], data=daily_ratings, x='published', y='avg_rating')
axes[0].title.set_text('daily')
# weekly
sns.lineplot(ax=axes[1], data=weekly_ratings, x='published', y='avg_rating')
axes[1].title.set_text('weekly')
# monthly
sns.lineplot(ax=axes[2], data=monthly_ratings, x='published', y='avg_rating')
axes[2].title.set_text('monthly')
ylab = ['avg_rating','','']
for i in range(len(axes)):
axes[i].set_xlabel('')
axes[i].set_ylabel(ylab[i])
axes[i].set_ylim(0,10)
axes[i].axhline(y=np.mean(data.rating), color='r');
plt.show();
Here, ratings are grouped into three buckets: average rating by day, week, and month. The red represents the mean rating. We can observe that the ratings tend fluctuate highest by day, and even out over time. Nevertheless, we can observe on a monthly basis, average ratings were very low at the end of 2017/beginning of 2018, and fluctuated from low to high in the 4th quarter of 2022. The daily ratings give us a another sense of how the lower scores may be influencing the overall ratings.
data_new.describe(include="all")
| artist | album | genre | label | releaseYear | writer | published | bestNewMusic | bestNewReissue | rating | review | url | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 5301 | 5301 | 5301 | 5301 | 5301 | 5301 | 5301 | 5301 | 5301 | 5301.000000 | 5301 | 5301 |
| unique | 3687 | 5248 | 91 | 1964 | 7 | 343 | 1581 | 2 | 1 | NaN | 5301 | 5301 |
| top | Various Artists | DJ-Kicks | Rock | self-released | 2018 | Philip Sherburne | 2019-04-11 | False | False | NaN | ’s ninth album, First Two Pages of Frankenstei... | https://pitchfork.com/reviews/albums/the-natio... |
| freq | 13 | 8 | 1567 | 175 | 1072 | 244 | 6 | 5068 | 5301 | NaN | 1 | 1 |
| mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 7.182079 | NaN | NaN |
| std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.824410 | NaN | NaN |
| min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.600000 | NaN | NaN |
| 25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 6.800000 | NaN | NaN |
| 50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 7.300000 | NaN | NaN |
| 75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 7.700000 | NaN | NaN |
| max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 10.000000 | NaN | NaN |
Let's take a look at writers, the number of reviews per writer, and the distribution of average ratings amongst those writers.
writers_df.describe()
| total_reviews | avg_rating | genres | pct_total | |
|---|---|---|---|---|
| count | 343.000000 | 343.000000 | 343.000000 | 343.000000 |
| mean | 15.454810 | 7.227959 | 3.886297 | 0.002924 |
| std | 26.458897 | 0.475075 | 3.529560 | 0.004987 |
| min | 1.000000 | 5.400000 | 1.000000 | 0.000200 |
| 25% | 2.000000 | 6.995000 | 1.000000 | 0.000400 |
| 50% | 5.000000 | 7.220000 | 3.000000 | 0.000900 |
| 75% | 16.000000 | 7.500000 | 5.000000 | 0.003000 |
| max | 244.000000 | 9.300000 | 23.000000 | 0.046000 |
343 writers have written reviews over this time frame, the average writer has written 15 reviews, but we can see the median number of reviews is 5.
writers_df.sort_values(by='total_reviews', ascending=False).round(5).head(10)
| writer | total_reviews | avg_rating | genres | pct_total | |
|---|---|---|---|---|---|
| 0 | Philip Sherburne | 244 | 7.53 | 18 | 0.0460 |
| 1 | Ian Cohen | 146 | 6.97 | 10 | 0.0275 |
| 2 | Stuart Berman | 131 | 7.48 | 14 | 0.0247 |
| 3 | Ben Cardew | 123 | 7.10 | 23 | 0.0232 |
| 4 | Alphonse Pierre | 121 | 6.59 | 4 | 0.0228 |
| 5 | Evan Rytlewski | 114 | 6.75 | 14 | 0.0215 |
| 6 | Sheldon Pearce | 112 | 7.03 | 9 | 0.0211 |
| 7 | Andy Beta | 104 | 7.25 | 16 | 0.0196 |
| 8 | Sam Sodomsky | 101 | 7.60 | 7 | 0.0191 |
| 9 | Grayson Haver Currin | 99 | 7.40 | 19 | 0.0187 |
sns.scatterplot(data=writers_df, x='total_reviews', y='avg_rating')
plt.axhline(y=np.mean(data_new.rating), color='r');
We can see that as a writer writes more reviews, the average rating gets closer to the mean. This is due more to the likelihood that a low rating amongst fewer reviews will be more influential than for writers with more reviews under their belt.
# writers with one review vs all other writers
fig, axes = plt.subplots(1, 2, figsize=(21, 6), layout='constrained')
# writers with one review
sns.histplot(ax=axes[0], data=writers_df[writers_df['total_reviews'] == 1], x='avg_rating')
axes[0].axvline(data_new.rating.mean(), c='r', lw=1.5)
axes[0].title.set_text('one review')
# all other writers
sns.histplot(ax=axes[1], data=writers_df[writers_df['total_reviews'] != 1], x='avg_rating')
axes[1].axvline(data_new.rating.mean(), c='r', lw=1.5)
axes[1].title.set_text('all others')
print('Total writers with one review: {}'.format(writers_df[writers_df['total_reviews'] == 1].writer.nunique()))
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Input In [11], in <cell line: 4>() 2 one_review = writers_df[writers_df['total_reviews'] == 1] 3 all_others = writers_df[writers_df['total_reviews'] != 1] ----> 4 one_mean = one_review.avg_rating.mean().round(1) 5 others_mean = all_others.avg_rating.mean().round(1) 7 # writers with one review vs all other writers AttributeError: 'float' object has no attribute 'round'
We can see that the distribution of ratings for writers with one review tends to skew towards lower ratings.
print("Total Labels: {}".format(labels_df.label.nunique()))
print("Total Genres: {}".format(data_new.genre.nunique()))
x = labels_genre.label.value_counts().reset_index().label
sns.countplot(x)
plt.xlabel('number of genres')
plt.ylabel('number of labels')
plt.title('number of genres per label')
plt.plot();
Total Labels: 1964 Total Genres: 91
/Users/joshiden/opt/anaconda3/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
one_genre = labels_df[labels_df['total_genres'] == 1].label.count()
total_pct = round(one_genre / labels_df.label.nunique() * 100, 1)
print('Total number of labels with reviews in one genre: {}'.format(one_genre))
print('Percentage of labels with reviews in one genre: {}%'.format(total_pct))
Total number of labels with reviews in one genre: 1613 Percentage of labels with reviews in one genre: 82.1%
sns.scatterplot(data=labels_df, x='total_genres', y='avg_rating')
plt.axhline(y=np.mean(data_new.rating), color='r');
It appears that the more genres a label is releasing records in, the closer the average rating is to the mean.
In this section, I take two approaches to try gain better insight into the contents of the reviews and how they may relate to Pitchfork ratings using the VADER algorithm and a lexical analysis.
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool included in the Natural Language Toolkit (NLTK) library in Python. VADER uses a lexicon (a list of words and their associated sentiment scores) to analyze the sentiment of a given text. The lexicon contains over 7,500 words and their associated positive or negative sentiment scores, as well as intensity modifiers that adjust the strength of the sentiment.
VADER outputs sentiment scores for the text on four dimensions: positive, negative, neutral, and compound. The positive, negative, and neutral scores represent the proportion of words in the text that are classified as positive, negative, or neutral. The compound score is a single number between -1 and 1 that represents the overall sentiment of the text, with -1 indicating extremely negative sentiment, 0 indicating neutral sentiment, and 1 indicating extremely positive sentiment.
The downside to the VADER algorithm is that it is designed specifically for analyzing sentiment in social media texts. However, we apply to the reviews by splitting the text of the reviews into individual sentences and combining each sentence score.
We start by defining a function to convert each score to a sentiment category.
# function to rank each review
def rank(score):
'''provides a ranking based on a score'''
if score <= -.32:
return "very negative"
if score <= -0.0:
return "negative"
if score <= .16:
return "average"
if score <= .32:
return "positive"
if score <= 1:
return "very positive"
Next, we instantiate the SentimentIntensityAnalyzer object, and define a function to read through the reviews and assign scores and sentiments using the rank function we defined above.
sia = SentimentIntensityAnalyzer()
def sentence_score(reviews):
'''compiles an aggregate score based on the sentences of a review'''
sentence_scores = []
sentence_sentiment = []
for review in reviews:
scores = [sia.polarity_scores(sentence)["compound"] for sentence in nltk.sent_tokenize(review)]
scores = round(np.mean(scores),3)
sentence_scores.append(scores)
sentiment = rank(scores)
sentence_sentiment.append(sentiment)
no_punc = re.sub(r'[^\w\s]', ' ', review).replace(' ', ' ')
words = nltk.word_tokenize(no_punc)
return sentence_scores, sentence_sentiment
data_new['score'], data_new['sentiment'] = sentence_score(data_new.review)
%time
CPU times: user 3 µs, sys: 1 µs, total: 4 µs Wall time: 7.15 µs
fig, axes = plt.subplots(1, 3, figsize=(21, 5), layout='constrained')
# rating
sns.scatterplot(ax=axes[0], data=data_new, x='score', y='rating')
axes[0].title.set_text('rating ~ score')
# score
sns.histplot(ax=axes[1], x=data_new.score)
axes[1].axvline(data_new.score.mean(), c='r', lw=0.8)
axes[1].title.set_text('score')
# sentiment
sns.countplot(ax=axes[2], x=data_new.sentiment, order=['very negative','negative','average','positive','very positive'])
axes[2].title.set_text('sentiment')
ylab = ['','','']
for i in range(len(axes)):
axes[i].set_xlabel('')
axes[i].set_ylabel(ylab[i]);
We see a few things here:
p = sns.FacetGrid(data_new, col="sentiment", height=3, aspect=1)
p.map(sns.histplot, "rating");
We can see that the distribution of ratings amongst the different sentiment categories are still centered around the mean of 7.2, and indeed this is the case.
print(tabulate(data_new.groupby('sentiment')['rating'].mean().round(2).reset_index().sort_values(by='rating', ascending=False), headers=['sentiment','average rating'], tablefmt="github", showindex=False, numalign="right"))
| sentiment | average rating | |---------------|------------------| | positive | 7.21 | | very positive | 7.2 | | negative | 7.19 | | very negative | 7.17 | | average | 7.15 |
In this case I include the average rating rounded to two decimal places so you can see that there is slight variation in the average ratings, and that the lowest average rating is amongst the "average" sentiment.
We can calculate and store the average score by writer to observe which writers write more positive reviews than others, and any relationship that might exits between those scores and other measures such as rating.
title = 'Top 10 Highest Average Sentiment Scores by Writer \n'
print(title)
print(tabulate(data_new.groupby('writer')['score'].mean().round(4).reset_index().sort_values(by='score', ascending=False).head(10), headers='keys', tablefmt="github", showindex=False))
Top 10 Highest Average Sentiment Scores by Writer | writer | score | |--------------------|---------| | Nazuk Kochhar | 0.461 | | Frankie Caracciolo | 0.423 | | Dash Lewis | 0.419 | | Lawrence Burney | 0.391 | | Brandon Stosuy | 0.351 | | Maria Barrios | 0.347 | | Shamira Ibrahim | 0.327 | | Jesse Weiss | 0.316 | | Tyra Nicole Triche | 0.304 | | Chris L. Terry | 0.304 |
# store average sentiment score by writer
writer_scores = data_new.groupby('writer')['score'].mean().round(4).reset_index().rename(columns={'score':'avg_score'})
As sentiment analysis using the VADER algorithm failed to yield results that could connect a relationship between sentiment and Pitchfork, I next sought to take a look at which words appeared most frequently in different categories of reviews to determine if a lexical approach might yield quantifiable metrics that could be connected with ratings. To do this, I took the following steps:
The goal here was to identify if there were any clear distinctions of words that are used more frequently amongst albums that received higher or lower Pitchfork ratings.
First, I created a couple functions to deal with the review text.
# create a corpus of reviews
def create_corpus(dfcolumn):
'''takes a dataframe and returns a single string of its contents'''
corpus = ''
for row in dfcolumn:
corpus += row.lower()
return corpus
# define a function to look through the corpus and return a frequency distribution object
def review_words(column, dropwords=[]):
'''takes a dataframe column and an optional list of dropwords and returns a frequency distribution object of its words'''
corpus = create_corpus(column)
# tokenize the words
words = nltk.word_tokenize(corpus)
# retain only alphanumeric words
words = [w for w in words if w.isalpha()]
# remove stopwords
stopwords = nltk.corpus.stopwords.words("english")
words = [w for w in words if w.lower() not in stopwords]
# Lemmatizing words
lemmatizer = WordNetLemmatizer()
words = [lemmatizer.lemmatize(word) for word in words]
# remove dropwords
words = [word for word in words if word not in dropwords]
# create frequency distribution
fd = nltk.FreqDist(words)
return fd
Now I look at the entire corpus and split the frequency distributions up into groups of 25 for better visualizion.
fd = review_words(data_new.review)
top25 = pd.DataFrame(fd.most_common(25), columns=['word','count'])
top50 = pd.DataFrame(fd.most_common()[25:50], columns=['word','count'])
top75 = pd.DataFrame(fd.most_common()[50:75], columns=['word','count'])
top100 = pd.DataFrame(fd.most_common()[75:100], columns=['word','count'])
fig, axes = plt.subplots(2, 2, figsize=(21, 12), layout='constrained')
# 1-25
sns.barplot(ax=axes[0,0], data=top25, y='word',x='count')
axes[0,0].title.set_text('Top Words: 1-25')
# 25-50
sns.barplot(ax=axes[0,1], data=top50, y='word',x='count')
axes[0,1].title.set_text('Top Words: 26-50')
# 50-75
sns.barplot(ax=axes[1,0], data=top75, y='word',x='count')
axes[1,0].title.set_text('Top Words: 51-75')
# 75-100
sns.barplot(ax=axes[1,1], data=top100, y='word',x='count')
axes[1,1].title.set_text('Top Words: 76-100')
ax_ct = [(0,0),(0,1),(1,0),(1,1)]
for i in ax_ct:
axes[i].set_xlabel('')
axes[i].set_ylabel('');
As we can see, there's a lot of replacable words that don't tell you much of anything related to music. Let's grab some of the musical words to keep from this list and remove the rest.
print(list(dict(fd.most_common(100)).keys()))
['like', 'album', 'song', 'sound', 'music', 'one', 'feel', 'track', 'new', 'time', 'record', 'band', 'even', 'make', 'year', 'way', 'voice', 'first', 'guitar', 'work', 'moment', 'much', 'vocal', 'love', 'two', 'every', 'life', 'pop', 'take', 'also', 'still', 'come', 'back', 'world', 'beat', 'could', 'never', 'rap', 'might', 'get', 'though', 'drum', 'sings', 'artist', 'hear', 'line', 'lyric', 'best', 'something', 'last', 'around', 'melody', 'rock', 'little', 'end', 'made', 'often', 'thing', 'would', 'sense', 'long', 'go', 'many', 'yet', 'always', 'part', 'past', 'debut', 'title', 'le', 'without', 'production', 'know', 'another', 'producer', 'kind', 'word', 'single', 'day', 'find', 'piano', 'since', 'solo', 'set', 'place', 'point', 'release', 'project', 'u', 'play', 'early', 'rapper', 'together', 'feeling', 'second', 'almost', 'enough', 'good', 'people', 'say']
# create list of words to drop
words_to_keep = ['sound','feel','voice','moment','vocal','love','life','pop','take','beat','rap','hear','lyric',
'best','melody','rock','sense','debut','production','producer','piano','solo','rapper','feeling','good']
dropwords = [word for word, count in fd.most_common(100) if word not in words_to_keep]
len(dropwords)
75
Now we reproduce the frequency distribution without the 75 most frequent words we've discarded, and observe the distribution of words by ratings group.
# reviews by rating
review_8s = review_words(data_new[data_new['rating'] >= 8.0].review, dropwords)
review_7s = review_words(data_new[(data_new['rating'] >= 7.0) & (data_new['rating'] < 8.0)].review, dropwords)
review_6s = review_words(data_new[(data_new['rating'] >= 6.0) & (data_new['rating'] < 7.0)].review, dropwords)
review_5s = review_words(data_new[data_new['rating'] < 6.0].review, dropwords)
# top words by rating
word_8s = pd.DataFrame(review_8s.most_common(25), columns=['word','count'])
word_7s = pd.DataFrame(review_7s.most_common(25), columns=['word','count'])
word_6s = pd.DataFrame(review_6s.most_common(25), columns=['word','count'])
word_5s = pd.DataFrame(review_5s.most_common(25), columns=['word','count'])
fig, axes = plt.subplots(2, 2, figsize=(21, 12), layout='constrained')
# 8s
sns.barplot(ax=axes[0,0], data=word_8s, y='word',x='count')
axes[0,0].title.set_text('8.0+')
# 7s
sns.barplot(ax=axes[0,1], data=word_7s, y='word',x='count')
axes[0,1].title.set_text('7.0-7.9')
# 6s
sns.barplot(ax=axes[1,0], data=word_6s, y='word',x='count')
axes[1,0].title.set_text('6.0-6.9')
# below 6
sns.barplot(ax=axes[1,1], data=word_5s, y='word',x='count')
axes[1,1].title.set_text('below 6')
ax_ct = [(0,0),(0,1),(1,0),(1,1)]
for i in ax_ct:
axes[i].set_xlabel('')
axes[i].set_ylabel('');
The only interesting observation I glean from this data is one word that is frequently used in reviews for albums of every group that receive below an 8 but is not amongst the most-frequent words for albums that receive an 8 or higher: production. However, every other category appears to similar words, making it challenging to quantify a distinction between categories. It does appear that there are many repeating words. Let's take a look at the most frequenly occurring trigrams to see if there are any distinctions to glean.
def trigram_extractor(column):
'''takes a dataframe column and returns a frequency distribution object of its words'''
corpus = create_corpus(column)
# tokenize the words
words = nltk.word_tokenize(corpus)
# retain only alphanumeric words
words = [w for w in words if w.isalpha()]
# remove stopwords
stopwords = nltk.corpus.stopwords.words("english")
words = [w for w in words if w.lower() not in stopwords]
# collect trigrams
finder = nltk.collocations.TrigramCollocationFinder.from_words(words)
return finder
t_grams = trigram_extractor(data_new.review)
# 20 most common trigrams
print(tabulate(t_grams.ngram_fd.most_common(20), headers=['trigram','count']))
trigram count
------------------------------- -------
('catch', 'every', 'saturday') 1429
('every', 'saturday', 'albums') 1429
('saturday', 'albums', 'week') 1429
('albums', 'week', 'sign') 1429
('week', 'sign', 'hear') 1429
('sign', 'hear', 'newsletter') 1429
('buy', 'catch', 'every') 707
('rock', 'n', 'roll') 206
('new', 'york', 'city') 130
('drum', 'n', 'bass') 125
('songs', 'feel', 'like') 55
('often', 'feels', 'like') 49
('album', 'feels', 'like') 49
('might', 'sound', 'like') 49
('first', 'two', 'albums') 48
('nearly', 'every', 'song') 47
('feels', 'less', 'like') 46
('sounds', 'like', 'could') 44
('two', 'years', 'later') 42
('album', 'title', 'track') 39
If we removed the top 7 trigrams (the variations of "catch every saturday"), there actually aren't a ton of frequently occurring trigrams considering we're looking at 5300 reviews.
In this section, we perform network analysis on two sets of variables in the dataset to try to establish a relationship between the writer and genre variables, and the label and genre variables.
The idea here is that the more genres a writers or record labels engage with, the more connected they will be amongst the entire network of pitchfork writers.
First I construct a dataframe with each observation consisting of a writer, a genre, and the number of times the writer covers the genre:
writer_genre = pd.DataFrame(pd.Series(data_new.groupby(['writer', 'genre']).size(), name="count")).reset_index()
writer_genre.head()
| writer | genre | count | |
|---|---|---|---|
| 0 | Abby Jones | Folk/Country | 1 |
| 1 | Abby Jones | Pop/R&B / Electronic | 1 |
| 2 | Abby Jones | Rock | 34 |
| 3 | Abigail Covington | Experimental | 1 |
| 4 | Abigail Covington | Folk/Country | 1 |
Next I construct a bipartite graph object and add writers and genres as nodes. The edges are the interactions between writers and genres, and the count is added as a weight to represent the number of times the writer and genre interact.
# list of unique writers
writers = list(writer_genre.writer.unique())
# list of unique genres
genres = list(writer_genre.genre.unique())
# instantiate Graph object
G = nx.Graph()
# add bipartite edges, 0: writer, 1: genre
G.add_nodes_from(writers, bipartite=0)
G.add_nodes_from(genres, bipartite=1)
# add edges
for index, row in writer_genre.iterrows():
G.add_edge(row[0], row[1], weight=row[2])
# checking the graph is connected
nx.is_connected(G)
True
# checking graph is bipartite
nx.is_bipartite(G)
True
We can graph the entire network, but with 343 writers and 91 it's quite challenging to draw a legible bipartite graph,
plt.figure(figsize=[21,16])
#colors = range(200)
options = {
"node_color": "#A0CBE2",
"edge_color": "lightgrey",
"width": 0.3,
"edge_cmap": plt.cm.Blues,
"with_labels": True,
}
top = nx.bipartite.sets(G)[0]
pos = nx.bipartite_layout(G, top)
nx.draw(G, pos, **options);
Another option, which further illustrates the challenge of creating a two-dimensional bipartite graph, is a spring layout.
seed=456 # set seed for reproducibility
plt.figure(figsize=[20,12])
# Assign colors to the top and bottom nodes
top_nodes = {n for n, d in G.nodes(data=True) if d['bipartite']==0} # writers
bottom_nodes = set(G) - top_nodes # genres
node_colors = ['lightblue' if n in top_nodes else 'lightyellow' for n in G.nodes()]
# grab the edge weights to use for edge width
weights = [edata['weight']/2 for n,e,edata in G.edges(data=True)]
# grab the degrees to use for node size
degs = [v*500 for k, v in G.degree()]
# Set up the layout for the graph
pos = nx.spring_layout(G)
# Draw the grah with different colors for writers and genres
nx.draw(G, pos, width=weights, node_size=degs, node_color=node_colors, with_labels=True)
plt.show();
The pyvis library offers more aesthetically pleasing options right out of the box, but can be computationally expensive and does not integrate well from jupyter notebooks in GitHub, so for this analysis, we'll stick with NetworkX.
Besides, we're not really interested in visualizing the connections between writers and genres, we're interested in how writers are connected to each other. To visualize this, we project the bipartite graph and look only at how the writers are connected. We define a function to calculate degree centrality measures from the graph object.
# function to calculate degree centrality measures
def graph_summary(graph):
'''
takes a networkx Graph object and returns a dataframe containing centrality measures
'''
degree = pd.Series(dict(nx.degree(graph)),name='degree')
weight = pd.Series(dict(nx.degree(graph, weight='weight')),name='weight')
closeness = pd.Series(dict(nx.closeness_centrality(graph)),name='closeness_centrality')
degree_centrality = pd.Series(dict(nx.degree_centrality(graph)),name='degree_centrality')
betweenness = pd.Series(dict(nx.betweenness_centrality(graph)),name='betweenness_centrality')
eigen_centrality = pd.Series(dict(nx.eigenvector_centrality(graph)),name='eigenvector_centrality')
return pd.concat([degree, weight, closeness, degree_centrality, betweenness, eigen_centrality], axis=1).sort_values(by='degree', ascending=False).reset_index().rename(columns={"index":"node"})
w = bi.weighted_projected_graph(G, writers)
writers = graph_summary(w).rename(columns={'node':'writer'})
writers.sort_values(by='degree', ascending=False).head(10)
| writer | degree | weight | closeness_centrality | degree_centrality | betweenness_centrality | eigenvector_centrality | |
|---|---|---|---|---|---|---|---|
| 0 | Matthew Ismael Ruiz | 335 | 936 | 0.979943 | 0.979532 | 0.005966 | 0.072258 |
| 2 | Ben Cardew | 334 | 1048 | 0.977143 | 0.976608 | 0.004873 | 0.072259 |
| 1 | Madison Bloom | 334 | 962 | 0.977143 | 0.976608 | 0.004904 | 0.072255 |
| 3 | Sam Goldner | 333 | 967 | 0.974359 | 0.973684 | 0.004592 | 0.072231 |
| 5 | Vrinda Jagota | 331 | 864 | 0.968839 | 0.967836 | 0.004062 | 0.072171 |
| 4 | Larry Fitzmaurice | 331 | 922 | 0.968839 | 0.967836 | 0.005292 | 0.072047 |
| 6 | Sasha Geffen | 330 | 888 | 0.966102 | 0.964912 | 0.006053 | 0.072012 |
| 7 | Evan Rytlewski | 330 | 900 | 0.966102 | 0.964912 | 0.005027 | 0.072019 |
| 8 | Jayson Greene | 329 | 923 | 0.963380 | 0.961988 | 0.004085 | 0.072004 |
| 9 | Nadine Smith | 329 | 847 | 0.963380 | 0.961988 | 0.004363 | 0.071986 |
Some explanation regarding the centrality measures above,
Degree refers to the number of connections a node -- in this case, writer -- has. It measures the writer's immediate influence on the network, in other words, the total number of writers that written about the same genre.
Weight refers to the strength of connection between two nodes. It is a measure of the amount of interaction between two nodes, ie, the number of times the writers have written about the same genres.
Closeness Centrality is a measure of how close a node is to all other nodes in the network. It is the inverse of the sum of the shortest distances between the node and all other nodes.
Degree Centrality is a measure of the importance of a node based on the number of edges it has. It is the number of edges connected to a node divided by the maximum possible number of edges in the network.
Betweenness Centrality is a measure of how often a node appears on the shortest path between any two other nodes in the network. It captures the idea of gatekeeping or brokerage, where nodes with high betweenness centrality can act as bridges between different parts of the network.
Eigenvector Centrality is a measure of the importance of a node based on the connections it has with other important nodes in the network. It considers both the number and importance of the nodes that a node is connected to. The importance of a node increases if it is connected to other nodes that are themselves important.
We can visualize the graph,
seed=123 # set seed for reproducibility
plt.figure(figsize=[20,12])
# grab the edge weights to use for edge width
weights = [edata['weight']/4 for n,e,edata in w.edges(data=True)]
# grab the degrees to use for node size
degs = [v for k, v in w.degree()]
# Set up the layout for the graph
pos = nx.spring_layout(w)
# Draw the grah with different colors for writers and genres
nx.draw(w, pos, width=weights, node_size=degs, with_labels=True)
plt.show();
In this graph, the size of the node represents the degree -- the number of total connections. We can see there are some almost indistinguishable-sized nodes, and some very large ones. We can use the island method to partition the graph into its largest cluster so we can identify the most influential nodes.
We define functions to iterate through the graph and return a table representing the modularity of the network -- how many clusters the network can be parititioned to while remaining connected. Then we can select the desired threshhold with which to reduce the network and create a new graph object including only the nodes above the selected threshold.
def trim_edges(g, weight=1):
"""
Takes a graph, and applies a threshold (“water level”),
letting all edges above a certain value through, and removing all others.
It returns a copy of the original graph, so it’s non-destructive
"""
g2 = nx.Graph()
for f, to, edata in g.edges(data=True):
if edata['weight'] > weight:
g2.add_edge(f, to, **edata)
return g2
def island_method(g, iterations=5):
weights = [edata['weight'] for f, to, edata in g.edges(data=True)]
mn = int(min(weights))
mx = int(max(weights))
# compute the size of the step, so we get a reasonable step in iterations
step = int((mx-mn)/iterations)
return [[threshold, trim_edges(g, threshold)] for threshold in range(mn,mx,step)]
islands = island_method(w)
print('threshold','\t\t','size','\t\t','conn. components')
for i in islands:
# print the threshold level, size of the graph, and number of connected components
print (i[0],'\t\t\t', len(i[1]), '\t\t\t',len(list(nx.connected_components(i[1]))))
threshold size conn. components 1 234 1 3 132 1 5 73 1 7 26 1 9 6 1 11 3 1
We set the threshold to 7 and visualize the top 26 writers,
w_trim = trim_edges(w, weight=7)
plt.figure(figsize=[20,12])
# grab the edge weights to use for edge width
weights = [edata['weight']/8 for n,e,edata in w_trim.edges(data=True)]
# grab the degrees to use for node size
degs = [v*500 for k, v in w_trim.degree()]
# Set up the layout for the graph
pos = nx.spring_layout(w_trim)
# Draw the grah with different colors for writers and genres
nx.draw(w_trim, pos, width=weights, node_size=degs, node_color='lightblue', with_labels=True)
plt.show();
Now we can clearly visualize the most connected writers. Again, this sort of visualization is far more effective with three-dimensional, interactive functionality.
Next, we combine our centrality measures with our sentiment analysis measures to create a master dataset that we can use to observe any relationships that may exist amongst variables.
writers_master = pd.merge(pd.merge(writer_scores, writers_df, on='writer'), writers, on='writer')
writers_master.head()
| writer | avg_score | total_reviews | avg_rating | genres | pct_total | degree | weight | closeness_centrality | degree_centrality | betweenness_centrality | eigenvector_centrality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Abby Jones | 0.1463 | 36 | 6.80 | 3 | 0.0068 | 208 | 289 | 0.716981 | 0.608187 | 0.000189 | 0.052622 |
| 1 | Abigail Covington | 0.2058 | 10 | 7.21 | 3 | 0.0019 | 236 | 401 | 0.763393 | 0.690058 | 0.000587 | 0.057819 |
| 2 | Adlan Jackson | 0.1093 | 14 | 7.10 | 5 | 0.0026 | 308 | 592 | 0.909574 | 0.900585 | 0.002635 | 0.069531 |
| 3 | Aimee Cliff | 0.1400 | 23 | 7.15 | 6 | 0.0043 | 327 | 808 | 0.957983 | 0.956140 | 0.003548 | 0.071947 |
| 4 | Alex Ramos | 0.1470 | 1 | 6.50 | 1 | 0.0002 | 163 | 163 | 0.653920 | 0.476608 | 0.000000 | 0.040817 |
Now let's take a look at the labels. First we construct a dataframe with each observation containing three columns: a label, a genre, and the number of times the genre occurs. Then we'll instantiate a Graph object and build the bipartite graph with labels and genres as nodes and the number of times they occur as the edge weights between them.
labels_genre.head()
| label | genre | count | |
|---|---|---|---|
| 0 | 1504190 Records | Rap | 1 |
| 1 | August Greene, LLC | Pop/R&B | 1 |
| 2 | Cohn Corporation | Rap | 1 |
| 3 | Daydream Library Series | Jazz / Pop/R&B / Rock | 1 |
| 4 | Daydream Library Series | Rock | 3 |
labels = list(labels_genre.label.unique()) # list of labels
genres = list(labels_genre.genre.unique()) # list of genres
G = nx.Graph()
G.add_nodes_from(labels, bipartite=0) # add nodes assigning a bipartite level to the two types of nodes
G.add_nodes_from(genres, bipartite=1)
for index, row in labels_genre.iterrows(): # iterate through the dataframe adding the edges and weights
G.add_edge(row[0], row[1], weight=row[2])
# is the graph connected?
nx.is_connected(G)
False
nx.is_bipartite(G)
True
Unlike the writers graph, see that the labels graph is not connected. This means there are some labels that are not connected to others in any way. Part of the reason for this is, as we know from earlier that 82% of all labels have reviews in one genre. In order to identify the most connected lables, we'll need to find the largest connect component of the network.
# isolate largest connected component
largest_cc = max(nx.connected_components(G), key=len)
# project the largest connected component subgraph onto a new Graph object: S
S = G.subgraph(largest_cc)
# check the subgraph is connected
nx.is_connected(S)
True
plt.figure(figsize=[21,16])
#colors = range(200)
options = {
"node_color": "#A0CBE2",
"edge_color": "lightgrey",
"width": 0.3,
"edge_cmap": plt.cm.Blues,
"with_labels": True,
}
top = nx.bipartite.sets(S)[0]
pos = nx.bipartite_layout(S, top)
nx.draw(S, pos, **options);
The full graph is still impossible to make sense of two-dimensionally, although we can see what look like the four most frequent genres (Rap, Pop/R&B, Electronic / Pop/R&B, and Rock),
print('Top 5 most frequent genres: \n')
print(tabulate(labels_genre.groupby('genre')['count'].sum().reset_index().sort_values(by='count', ascending=False).head(5),headers="keys", tablefmt="github", showindex=False, numalign="right"))
Top 5 most frequent genres: | genre | count | |--------------|---------| | Rock | 1740 | | Rap | 950 | | Pop/R&B | 698 | | Electronic | 692 | | Experimental | 500 |
But there's some problems with this, it's not really clear from the graph that Rock is the most frequent genre by far, and there is some ambiguity aroujnd the Electronic and Experimental nodes that blur the distinction between the two. So again, we'll project the bipartite graph to look just at the labels that have the most interactions using the island method:
# list of labels in subgraph
labels_b = [x for x,y in S.nodes(data=True) if y['bipartite']==0]
# project the bipartite graph
l = bi.weighted_projected_graph(S, labels_b)
# island method
islands = island_method(l)
print('threshold','\t\t','size','\t\t','conn. components')
for i in islands:
# print the threshold level, size of the graph, and number of connected components
print (i[0],'\t\t\t', len(i[1]), '\t\t\t',len(list(nx.connected_components(i[1]))))
threshold size conn. components 1 349 1 2 153 1 3 86 1 4 47 1 5 20 1 6 13 1 7 5 1 8 2 1 9 2 1
We'll look at the 20 most connected nodes by setting the threshold to 5 and plot the graph.
# trim the graph
l_trim = trim_edges(l, 5)
# plot the graph
plt.figure(figsize=[20,12])
# grab the edge weights to use for edge width
weights = [edata['weight']/8 for n,e,edata in l_trim.edges(data=True)]
# grab the degrees to use for node size
degs = [v*500 for k, v in l_trim.degree()]
# Set up the layout for the graph
pos = nx.spring_layout(l_trim)
# Draw the grah with different colors for writers and genres
nx.draw(l_trim, pos, width=weights, node_size=degs, node_color='lightyellow', with_labels=True)
plt.show();
We see self-released albums at the center of the network, which is a good reference point for any relationship between interaction by genre, as self-released albums are by nature not bound by a single label. The other labels amongst the subgroup consist of labels widely considered to be highly influential.
In order to produce centrality measures for the network and observe them against the rest of the dataset, we need to remove the observations from the dataset that are not connected for our modeling.
not_connected = len(labels_df.loc[-labels_df['label'].isin(labels_b)])
print('{} labels not connected. They are: \n'.format(not_connected))
print(tabulate(labels_df[['label','total_reviews']].loc[-labels_df['label'].isin(labels_b)], headers="keys", showindex=False, tablefmt="github"))
11 labels not connected. They are: | label | total_reviews | |------------------------------------|-----------------| | Monster Boy | 1 | | Mavin / Jonzing World / Virgin | 1 | | Season of Mist | 1 | | New Amsterdam / figureight | 1 | | Impatience | 1 | | Trojan Jamaica / BMG | 1 | | Conmatizque | 1 | | Sector 7-G | 1 | | Youth Sounds / Cadiz Entertainment | 1 | | Metallic Music/1789 | 1 | | New Deal / Impulse! | 1 |
label_master = graph_summary(l).rename(columns={'node':'label'})
label_master.sort_values(by='degree', ascending=False).head(10)
| label | degree | weight | closeness_centrality | degree_centrality | betweenness_centrality | eigenvector_centrality | |
|---|---|---|---|---|---|---|---|
| 0 | self-released | 2013 | 2619 | 0.962759 | 0.961318 | 0.037651 | 0.055082 |
| 1 | Columbia | 1926 | 2400 | 0.925729 | 0.919771 | 0.019705 | 0.054968 |
| 2 | XL | 1861 | 2266 | 0.899871 | 0.888730 | 0.018314 | 0.054410 |
| 3 | Warp | 1798 | 2160 | 0.876151 | 0.858644 | 0.014295 | 0.054193 |
| 4 | Ghostly International | 1772 | 2099 | 0.866722 | 0.846227 | 0.012349 | 0.054128 |
| 5 | Loma Vista | 1749 | 2089 | 0.858197 | 0.835244 | 0.015004 | 0.053283 |
| 6 | Sub Pop | 1715 | 2070 | 0.846403 | 0.819007 | 0.014870 | 0.053063 |
| 7 | Island | 1696 | 2014 | 0.839952 | 0.809933 | 0.014477 | 0.053003 |
| 8 | BMG | 1651 | 1925 | 0.825059 | 0.788443 | 0.019116 | 0.052469 |
| 9 | Leaving | 1635 | 1893 | 0.820212 | 0.780802 | 0.015001 | 0.050666 |
# subset only connected labels
labels_sub = labels_df.loc[labels_df['label'].isin(labels_b)]
labels_sub.head()
| label | total_reviews | total_genres | avg_rating | |
|---|---|---|---|---|
| 0 | self-released | 175 | 24 | 7.26 |
| 1 | Domino | 70 | 8 | 7.39 |
| 2 | Sub Pop | 58 | 9 | 7.17 |
| 3 | Merge | 58 | 9 | 7.43 |
| 4 | Columbia | 58 | 9 | 6.77 |
# average score by label
label_score = data_new[data_new['label'].isin(labels_b)].groupby('label')['score'].mean().reset_index().rename(columns={'score':'avg_score'})
# combine dataframes
label_m1 = pd.merge(pd.merge(labels_sub[['label','avg_rating']],label_score, on='label'), label_master, on='label')
label_m1.sort_values(by='degree', ascending=False).head(10)
| label | avg_rating | avg_score | degree | weight | closeness_centrality | degree_centrality | betweenness_centrality | eigenvector_centrality | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | self-released | 7.26 | 0.138531 | 2013 | 2619 | 0.962759 | 0.961318 | 0.037651 | 0.055082 |
| 4 | Columbia | 6.77 | 0.125328 | 1926 | 2400 | 0.925729 | 0.919771 | 0.019705 | 0.054968 |
| 23 | XL | 7.62 | 0.156536 | 1861 | 2266 | 0.899871 | 0.888730 | 0.018314 | 0.054410 |
| 13 | Warp | 7.64 | 0.138167 | 1798 | 2160 | 0.876151 | 0.858644 | 0.014295 | 0.054193 |
| 75 | Ghostly International | 7.40 | 0.217250 | 1772 | 2099 | 0.866722 | 0.846227 | 0.012349 | 0.054128 |
| 20 | Loma Vista | 7.01 | 0.108067 | 1749 | 2089 | 0.858197 | 0.835244 | 0.015004 | 0.053283 |
| 2 | Sub Pop | 7.17 | 0.180828 | 1715 | 2070 | 0.846403 | 0.819007 | 0.014870 | 0.053063 |
| 38 | Island | 6.24 | 0.131450 | 1696 | 2014 | 0.839952 | 0.809933 | 0.014477 | 0.053003 |
| 34 | BMG | 6.63 | 0.151609 | 1651 | 1925 | 0.825059 | 0.788443 | 0.019116 | 0.052469 |
| 47 | Leaving | 7.24 | 0.203444 | 1635 | 1893 | 0.820212 | 0.780802 | 0.015001 | 0.050666 |
Now we can attempt to model our data and observe any relationship between review sentiments, network centrality, and ratings.
In this section, we'll take a look at the writer and label centrality measures along with the sentiment scores produced for each review, and observe if any relationship exists between the variables sufficient to employ a regression model that can be used to predict ratings.
First, we'll map the centrality measures to the first dataset. Then we'll visualize scatter plots and construct a correlation matrix to determine if there any observable patterns or correlation amongst the variables.
# subset the data_new dataframe to the largest connected component of the label network
# retain only columns: artist, album, label, writer, rating, score
cols = ['artist','album','label','writer','rating','score']
model_df = data_new.loc[data_new['label'].isin(labels_b)][cols]
# create dictionaries of the centrality measure columns to map to model dataframe
# writers
w_close = dict(zip(writers.writer, writers.closeness_centrality))
w_degree = dict(zip(writers.writer, writers.degree_centrality))
w_bet = dict(zip(writers.writer, writers.betweenness_centrality))
w_eig = dict(zip(writers.writer, writers.eigenvector_centrality))
# labels
l_close = dict(zip(label_master.label, label_master.closeness_centrality))
l_degree = dict(zip(label_master.label, label_master.degree_centrality))
l_bet = dict(zip(label_master.label, label_master.betweenness_centrality))
l_eig = dict(zip(label_master.label, label_master.eigenvector_centrality))
# map dictionaries to new columns in model dataframe
# add centrality measure columns
# writers
model_df['w_close'] = model_df['writer'].map(w_close)
model_df['w_degree'] = model_df['writer'].map(w_degree)
model_df['w_between'] = model_df['writer'].map(w_bet)
model_df['w_eigen'] = model_df['writer'].map(w_eig)
# labels
model_df['l_close'] = model_df['label'].map(l_close)
model_df['l_degree'] = model_df['label'].map(l_degree)
model_df['l_between'] = model_df['label'].map(l_bet)
model_df['l_eigen'] = model_df['label'].map(l_eig)
model_df.head()
| artist | album | label | writer | rating | score | w_close | w_degree | w_between | w_eigen | l_close | l_degree | l_between | l_eigen | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | The National | First Two Pages of Frankenstein | 4AD | Brad Shoup | 6.6 | 0.143 | 0.716981 | 0.608187 | 0.000189 | 0.052622 | 0.755139 | 0.677650 | 0.007824 | 0.046593 |
| 1 | Bill Evans | Treasures: Solo, Trio and Orchestra Recordings... | Elemental Music | Mark Richardson | 8.1 | 0.377 | 0.797203 | 0.745614 | 0.001042 | 0.060945 | 0.505797 | 0.047755 | 0.000000 | 0.002252 |
| 2 | Remi Kabaka | Son of Africa | BBE Music | Dean Van Nguyen | 7.4 | 0.246 | 0.936986 | 0.935673 | 0.003821 | 0.070861 | 0.689951 | 0.551576 | 0.004997 | 0.044517 |
| 3 | Portrayal of Guilt | Devil Music | Run for Cover | Marty Sartini Garner | 7.4 | -0.157 | 0.800937 | 0.754386 | 0.001288 | 0.061330 | 0.656015 | 0.478032 | 0.001923 | 0.040450 |
| 4 | Jim Legxacy | HNPM | (!) | Dylan Green | 7.8 | 0.266 | 0.738661 | 0.649123 | 0.001646 | 0.050636 | 0.361908 | 0.001433 | 0.000000 | 0.000019 |
Now we can visualize some scatter plots with ratings as the y (dependent) variable,
def scatter_figs(df, title):
'''takes model dataframe and returns scatter plots'''
fig = make_subplots(rows=3, cols=3, shared_yaxes=True)
fig.add_trace(go.Scatter(x=df.score, y=df.rating, mode='markers', name="score"),
row=1, col=1)
fig.add_trace(go.Scatter(x=df.w_close, y=df.rating, mode='markers', name="writer closeness"),
row=1, col=2)
fig.add_trace(go.Scatter(x=df.w_degree, y=df.rating, mode='markers', name="writer degree"),
row=1, col=3)
fig.add_trace(go.Scatter(x=df.w_between, y=df.rating, mode='markers', name="writer between"),
row=2, col=1)
fig.add_trace(go.Scatter(x=df.w_eigen, y=df.rating, mode='markers', name="writer eigen"),
row=2, col=2)
fig.add_trace(go.Scatter(x=df.l_close, y=df.rating, mode='markers', name="label closeness"),
row=2, col=3)
fig.add_trace(go.Scatter(x=df.l_degree, y=df.rating, mode='markers', name="label degree"),
row=3, col=1)
fig.add_trace(go.Scatter(x=df.l_between, y=df.rating, mode='markers', name="label between"),
row=3, col=2)
fig.add_trace(go.Scatter(x=df.l_eigen, y=df.rating, mode='markers', name="label eigen"),
row=3, col=3)
fig.update_layout(height=600, width=850,
title_text=title)
return fig.show("svg")
scatter_figs(model_df, 'Rating as a Function of Predictors')
There does not appear to be a relationship between rating and the predictors we've identified -- we've earlier noted the inconsistency among the sentiment score -- we can see for a few of the variables (specifically the betweenness and eigenvector measures, there may be different levels or categories indicating an interaction term may be appropriate in a regression model, but the absence of any clear correlation is concerning. We can produce a correlation matrix to determine if there is in fact any correlation amongst the predictors and the response.
plt.figure(figsize=[12,8])
corr_matrix = model_df.iloc[:,4:].corr()
sns.heatmap(corr_matrix, annot=True, cmap="YlGnBu")
plt.show();
We observe high collinearity amongst the centrality measures, and nearly 0 correlation between each predictor and the response, eliminating the possibility of producing a predictive model using the VADER sentiment scores and centrality measures based on diversity of genres. For a run through of what a predictive model using the compiled data returns, please see Appendix C.
## ALL PACKAGES USED FOR THIS REPORT -- RUN THIS CELL FIRST TO REPRODUCE CODE ##
import pandas as pd
import numpy as np
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.collocations import TrigramCollocationFinder
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import ngrams, bigrams, trigrams
import networkx as nx
from networkx.algorithms import bipartite as bi
from tabulate import tabulate
import re
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.io as pio
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from statsmodels.stats.outliers_influence import variance_inflation_factor
pio.renderers.default = "svg"
%matplotlib inline
# loading the full dataset
data = pd.read_csv('data.csv').dropna()
# remove brackets and parenthesis from text
data['review'] = [re.sub('\(.*?\)|\[.*?\]','',x) for x in data['review']]
# remove asterisks
data['review'] = [re.sub("\*", " ", x) for x in data['review']]
# remove double parenthesesis
data['review'] = [x.replace('\\"', ' ') for x in data['review']]
# remove double backslashes
data['review'] = [re.sub("\\\\", ' ', x) for x in data['review']]
# white space after periods
data['review'] = [x.replace(".", '. '). replace(" .", ".").replace(" ", " ").replace("\"", '').replace(",,", ",").replace(" ,", ",") for x in data['review']]
# rename score column: rating
data.rename(columns={'author':'writer','score':'rating'}, inplace=True)
# releaseYear column as string
data['releaseYear'] = [str(x).replace(".0", "") for x in data['releaseYear']]
## FILTERING OUT REISSUES AND NEW ALBUMS ##
# list of possible words denoting reissue
reissues = ['deluxe','reissue','anniversary', 'Anniversary)','edition']
# filter out reissue reviews - only new reviews
data_new = data[data['releaseYear'] >= '2017']
data_new = data_new[data_new['bestNewReissue'] != True]
data_new = data_new[~data_new['album'].str.lower().isin(reissues)]
data_new = data_new[data_new['artist'] != 'King Crimson']
# isolate the reissues
data_reissues = data[(data['releaseYear'] < '2017') | (data['album'].str.lower().isin(reissues)) | (data['artist'] == 'King Crimson')]
## EXPLORING RATINGS OVER TIME ##
# daily ratings
daily_ratings = data_new[['published','rating']]
daily_ratings['published'] = pd.to_datetime(daily_ratings['published'])
daily_ratings = daily_ratings.groupby('published')['rating'].mean().round(1).reset_index().rename(columns={'rating':'avg_rating'})
# weekly ratings
weekly_ratings = daily_ratings.groupby([pd.Grouper(key='published', freq='W')])['avg_rating'].mean().round(1).reset_index()
# monthly ratings
monthly_ratings = daily_ratings.groupby([pd.Grouper(key='published', freq='M')])['avg_rating'].mean().round(1).reset_index()
/var/folders/82/s2jpfjnx23x79yzgnlx3kx_h0000gp/T/ipykernel_29215/510172775.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
## WRITERS ##
# total number of reviews
writers_reviews = data_new.writer.value_counts().reset_index().rename(columns={'index':'writer','writer':'total_reviews'})
# avg rating
writers_ratings = data_new.groupby('writer')['rating'].mean().round(2).reset_index().rename(columns={'index':'writer','rating':'avg_rating'})
# total number of genres
writers_genres = data_new.groupby('writer')['genre'].nunique().reset_index().rename(columns={'genre':'genres'})
# combine datasets
writers_df = pd.merge(pd.merge(writers_reviews, writers_ratings, on='writer'), writers_genres, on='writer')
# percentage of total reviews
writers_df['pct_total'] = writers_df['total_reviews'].agg(lambda x: x/len(data_new)).round(4)
## LABELS ##
# genres by label
labels_genre = pd.DataFrame(pd.Series(data.groupby(['label', 'genre']).size(), name="count")).reset_index()
# number of review by label
label_reviews = pd.Series(data_new.label.value_counts(ascending=False), name='total_reviews').reset_index().rename(columns={'index':'label'})
# average rating per label
label_ratings = data_new.groupby('label')['rating'].mean().round(2).reset_index().rename(columns={'rating':'avg_rating',})
# total genres by label
total_genres = labels_genre.label.value_counts().reset_index().rename(columns={'index':'label','label':'total_genres'})
# master label
labels_df = pd.merge(pd.merge(label_reviews, total_genres, on='label'), label_ratings, on='label')
# labels with more than ten reviews
top_reviews = labels_df[labels_df['total_reviews'] >= 10].sort_values(by='avg_rating',ascending=False)
# split the model into training and testing sets: 80/20 split
train, test = train_test_split(model_df.iloc[:,4:], test_size=0.2, random_state=42)
X_train = train.drop('rating', axis=1) # predictors
y_train = train['rating'] # response variable
# add constant to predictors
X_train = sm.add_constant(X_train)
# create the regression model
model = sm.OLS(y_train, X_train).fit()
# print the model summary
print(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: rating R-squared: 0.014
Model: OLS Adj. R-squared: 0.012
Method: Least Squares F-statistic: 6.868
Date: Tue, 16 May 2023 Prob (F-statistic): 7.14e-10
Time: 12:59:02 Log-Likelihood: -5141.7
No. Observations: 4232 AIC: 1.030e+04
Df Residuals: 4222 BIC: 1.037e+04
Df Model: 9
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 6.1131 0.765 7.990 0.000 4.613 7.613
score 0.1342 0.099 1.349 0.177 -0.061 0.329
w_close 4.8105 1.268 3.794 0.000 2.324 7.296
w_degree -8.1150 1.416 -5.729 0.000 -10.892 -5.338
w_between 14.4671 17.700 0.817 0.414 -20.234 49.168
w_eigen 72.6310 11.785 6.163 0.000 49.526 95.736
l_close -2.1367 1.269 -1.684 0.092 -4.625 0.351
l_degree 0.7408 0.533 1.390 0.165 -0.304 1.786
l_between 11.9358 5.489 2.174 0.030 1.174 22.698
l_eigen -3.2812 2.088 -1.572 0.116 -7.374 0.812
==============================================================================
Omnibus: 1222.847 Durbin-Watson: 2.020
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4344.028
Skew: -1.419 Prob(JB): 0.00
Kurtosis: 7.072 Cond. No. 2.62e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.62e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
We can see there is an extremely low R-squared and multicollinearity amongst the predictors. We use the SequentialFeatureSelector to identify the predictors with the most significance to the model:
X_train = train.drop('rating', axis=1)
y_train = train['rating']
# instantiate sequential feature selector
selector = SequentialFeatureSelector(estimator=LinearRegression(),
n_features_to_select=3,
direction='forward',
scoring='r2')
# fit the selector to the data
selector.fit(X_train, y_train)
# view selected features
selected_features = X_train.columns[selector.get_support()]
print(selected_features)
Index(['w_between', 'w_eigen', 'l_degree'], dtype='object')
# fit the model using the selected features as predictors
X_train = train[selected_features] # predictors
y_train = train['rating'] # response variable
# add constant to predictors
X_train = sm.add_constant(X_train)
# create the regression model
model = sm.OLS(y_train, X_train).fit()
# print the model summary
print(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: rating R-squared: 0.004
Model: OLS Adj. R-squared: 0.004
Method: Least Squares F-statistic: 5.963
Date: Tue, 16 May 2023 Prob (F-statistic): 0.000471
Time: 13:01:58 Log-Likelihood: -5163.5
No. Observations: 4232 AIC: 1.033e+04
Df Residuals: 4228 BIC: 1.036e+04
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 7.1388 0.095 75.018 0.000 6.952 7.325
w_between -32.7477 10.663 -3.071 0.002 -53.652 -11.844
w_eigen 2.8215 1.721 1.640 0.101 -0.552 6.195
l_degree -0.1456 0.052 -2.792 0.005 -0.248 -0.043
==============================================================================
Omnibus: 1260.309 Durbin-Watson: 2.016
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4531.775
Skew: -1.461 Prob(JB): 0.00
Kurtosis: 7.143 Cond. No. 920.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
We can see the R-squared has actually gone down, and the writer eigencentrality predictor is not significant to the model.